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A scheduler apparatus provides bandwidth guaranty 
packet flows as well as to aggregations of those flows (referred to as "^bundles") in 

5 a completely transparent manner, i.e^ without using any aditttional scheduling 
structure. For each bundle* the scheduler determines the ratio between the 
bandwidth nominally allocated to the bundle and tliie sum of the individual 
bandwidth allocations of the flows that are currently backlogged in the bundle. 
The scheduler uses that ratio to modulate the timestamp incre m ents that regulate 

to the distribution of bandwidth to the individual flows. In this manner, the greater 
die ratio for that bundle* die more the bandwidth that each backlogged flow in the 
bundle receives. 
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METHOD AND APPARATUS FOR HIERARCHICAL BANDWIDTH 
DISTRIBUTION IN A PACKET NETWORK 

RELATED APPLICATION 

5 This ^plication is based oa a provisional applicatioo^ Serial num 

60/260807, filed on January 1 0, 2001 , and entitled ''Method and Apparatus for 
Hierarchical Bandwidth Distribution in a Packet Network.'* 

TECHNICAL FIELD OF THE INVENTION 

This invention relates to packet schedulers, and more particularly to a 
10 packet-scheduling apparatus and method fot guaranteeing data transfer rates to 
individual data sources and to aggregations of those data sources. 

BACKGROUND OF THE INVENTION 

The increasing popularity of elaborate Quality-of-Service (QoS) 
frameworks such as Integrated Services [see reference 1 listed in the attached 

15 Appendix] and Differentiated Services [2] puts emphasis on packet schedulers 
that allow flexible bandwidth management Several existing packet schedulers 
offer excellent worst-case delay performance in addition to providing accurate 
bandwidth guarantees [3, 4, 5, 6, 7, 8], but their cost is substantial [6, 7, 8]. In IP 
networks, the enforcement of tight delay guarantees is still rather secondary to the 

20 low-cost provisi<Mi of robtist bandwidth guarantees. For this reason, the industry 
is showing considerable interest in Weighted Round Robin (WRR) schedulers [9, 
10, 1 1], which do not necessarily achieve tight delay bounds, but do provide 
robust bandwidth guarantees with minimal complexity. Different instances of 
these schedulers have appeaxd in literature; well-known examples are the Deficit 

25 Round Robin (DRR) algorithm ( 1 0] and the Surplus Round Robin (SRR) 
algorithm [11]. 
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The aforementioned instances of WRR schedulers success&lly 
differentiate the bandwidth guarantees of heterogeneous data packet flows. 
However, with respect to the form in which they are currently defined, they are 
not sufficient to satisfy all the bandwidth requirements of mierging Quaiity-of- 
5 Service frameworks. Typically, flexible bandwidth management at the network 
nodes requires the deployment of hierarchical scheduling structures, where 
bandwidth can be allocated not only to individual flows, but also to aggregations 
pf those flows. With existing WRR schedulers, the.superimposition of a 
hierarchical structure for achieving bandwidth segregation compromises the 
10 simplicity of the basic scheduler. 

What is desired in the ait of scheduling for data packet netwcxks is an 
improved scheduler apparatus that achieves hierarchical bandwidth segregation 
without compromising the simplicity of the basic scheduler* 

15 

SUMMARY OF THE INVENTION 

It is an object of the present invention to define a scheduler apparatus for 
hierarchical bandwidth segregation. The objective of the new technique is to 
provide bandwidtfi guarantee to individual data packet flows as well as to 

20 ^gregations of those flows (referred to as ^bundles**) in a completely transparent 
manner, i.e., without using any additional scheduling structure. For each bundle, 
the scheduler determines the ratio between the bandwidth nominally allocated to 
the bundle and the sum of the individual bandwidth allocations of the flows that 
are currently becklogged in the bundle. The schedider uses that ratio to modulate 

25 the timestamp increments that regulate the distribution of bandwidth to the 

individual flows. In this manner, the greater the ratio for that bundle, the more the 
bandwidth that each backlogged flow in the bundle receives. The scheduler 
always meets the bandwidth requirements of a given data packet flow as long as 
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the flow remains backlogged, and the bandwidth requirements of a given bundle 
as long as there is at least one backlogged flow in the bundle. 

More particularly, the scheduler apparatus of the present invention 
S schedules the transmission of data packets for a plurality of data packet flows, 
said data packet flows being allocated givm shares of the transmission capacity r 
of a communication link and being grouped in bundles, sud btmdles being 
allocated service shares of the processing capacity of said communication link, 
the transmission over the communication link being divided in> service firames« a 

10 service frame offering at least one transmission opportunity to every data packet 
flow that is backlogged, a backlogged data packet fltiW being a data packet flow 
that has at least one data packet stored in respective one of a plurality of packet 
queues, the scheduling apparatus comprising: (I) means for determining the 
dxaration of the service frame; and (2) means for guaranteeing that each data 

I s packet flow always receives at least its allocated service share if it remains 

continuously backlogged over a sufficient number of consecutive service frames^ 
and that each bundle receives at least its allocated service share If there is always 
at least one data packet flow in the bundle that remains continuously backlogged 
for the whole duration of a service frame over a sufficient number of consecutive 

20 service ftames, said guaranteeing means inclucUi^: (A) means for maintaining, for 
each bundle /, acumulative share <t>^ thai relates to the sum of said service shares 
allocated to respective ones of said data packet flows that are grouped together in 
the same bundle /; and (B) means for computing, for each bundle A a service ratio 
between the service share J?, allocated to said bundle / and said cimiulative share 

25 <t>/ of the bundle; and (C) means for modulating said service shares allocated to 
respective ones of said plurality of data packet flows using the service ratio 
computed for respective ones of said plurali^ of bundles. 
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The present invention is also directed to a method for scheduling the 
transmission of data packets for a plurality of data packet flows, said data packet 
flows being allocated given shares of the transmission . capacity of an outgoing 
communication link and being grouped in a plurality of bundles, said bundles 
S being allocated service shares of the transmission capacity r of said outgoing 
communication link, the transmission over the communication link being divided 
in service frames, a service firame offering at least one transmission opportunity to 
every data packet flow that is backlogged, a backlogged data packet flow being a 
data packet flow that has at least one data packet stored in respective one of a 

10 plurality of packet queues, the method comprising the steps of: <1) determining 
the duration of the service frame; (2) guaranteeing that each data packet flow 
always receives at least its allocated service share if it remains continuously 
backlogged over a sufficient number of consecutive service frames, and that each 
bundle receives at least its allocated service share if there is always at least oiie 

1 5 data packet flow in the bundle that remains continuously backlogged fbr the 

whole duration of a service frame over a sufficient number of consecutive service 
frames; (3) maintaining, for each bundle A a cumulative share that relates to 
the sum of said service shares allocated to respective ones of said data packet 
flows that are grouped together in the same bundle I; (4) ocmiiNiting, for each 

20 bundle/, a service ratio between the service share Rf allocated to said bundle / 
and said cumulative share <t>f of the bundle; and (5) modulating said service 
shares allocated to respective ones of said plurality of data packet flows using the 
service ratio computed fw respective ones of said plurality of bundles. 

BRIEF DESCRIPTION OF THE DRAWINGS 
25 In the drawings. 

Fig. 1 shows an illustrative packet network including data sources. 
CGfmmunication switches, and data destinations. 
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Fig. 2 shows an illustrative communication switch used in the pack^ 
network of Fig. L 

Fig. 3 A shows an example of pseudo-code used in a Deficit Round Robin 
(DRR) algorithm when a new packet arrives at the head of a flow queue. 
S Fig. 3B shows an example of pseudo-code used in a Surplus Round Robin 

(SRR) algorithm when the server completes the transmission of a packet. 

Ftg. 4 shows, in accordance with the present invention^ a diagram 
illustrating the two-layered logical organization of the scheduler. 

Fig. S shows a functional diagram of the queues, state tabl^ registers, and 
10 parameters utilized by the scheduler of the present invention. 

Fig. 6 shows an illustrative block diagram of a particular implementation 
of the apparatus of Fig. 5. 

Figs. 7 and 7A to 7D show an illustrative flowchait describii^ a method of 
scheduling ttw transmission of packets in accoidance wi^ tbe present inventio«L 
15 Fig* 8 shows an example of pseudo-code used by the sdMdiiler after 

completing the transmisstcm of a packet 

In the following description* identical element designations in different 
figures represent identical elements. Additionally in die element des i g n ations, the 
first digit refers to the figure in which the designated element is first located (e.g.. 
20 the element designated as 102 is first located in Fig. 1)* 

DETAILED PESCRIPnON 

Figure 1 shows an illustrative packet network in which a plurality of 
switches 101*1 through 101 -p are connected to each other by communication 
25 links. A number of data sources 102-1 through 102-q are connected to the 

communication switches. A network connection is established from each of the 
data sources to a corresponding destination 103*1 through 103-g, and data packets 
are transnutted from each data source to the corresponding destinations. 
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Figure 2 shows an illustrative block diagram of the communication switch 
101-1 of the packet network. As shown, the conununication switch includes a 
switch fabric 250 and a plurality of communication link interfaces 200-1 through 
5 200-s. Each of the communication link interfaces connects a plurality of input 
links to an output link and transfers data packets from the input links to the output 
link* The communication switch 101 -I may contain just one or a plurality of such 
communication link interfaces 200. For example, input communication link 
inter&ce 200-1 is located in dront of the switch fabric 2S0, in which case its input 

10 links 201-1 throi^ 20l-r are input links of the communication switch lOl-U and 
its output link 203 connects to the switch fabric 250. As a second example, 
output conununication link interface 200-j is located at the output of the switch 
fabric 2S0, where its input links may be a pitirality of ouqnit links 204 of the 
switch fabric 250, and its output link is an output link 202-j of the communication 

15 switch 101-1. It should be noticed that packets received over a particular link or 
over different links may or may not have the same length. For example, if the 
switch fkbric 250 is an Asynchronous Transfer Mode (ATM) switch and the 
network of Fig. 1 is an ATM networic, then all packets have the same length. In 
the following description of the invention, the assumption is that .packets received 

20 over a particular link or over different links have not necessarily the same length. 

As will be discussed in a later paragraph witih reference to Fig. 6, each of 
the commimication link interfaces 200 of Fig. 2 typically includes at least a packet 
receiver, a schedule, and a packet transmitter. As stated above, the scheduler 
25 may be a Weighted Round Robin (WRR) scheduler [9, 10, 1 1], which in turn may 
be implemented according to a Deficit Round Robin (DRR) [10] or a Surplus 
Round Robin (SRR) [11] algorithm. 
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The £>eficit Round Robin (DRR) algorithm is one of the most popular 
instances of a WKR scheduler for variable-sized packets, due to its minimal , 
implementation complexity and its etTiciency in servicing the flows in proportion 
to their allocated service shares. Conforming to the WRR paradigm^ the DRR 
5 algorithm associates a service share p/ with each configured flow /. The service 
shares translate into minimum guaranteed service rates when their sum over ail 
configured flows does not exceed the capacity r of the server: 

Sa . (I) 

The bound of eq. (1), where V is the total number of configured flows, guarantees 
1 0 that flow I receives service at a long-temi rate that is not lower than Pi. 

The DRR algorithm divides the activity of the server into service frames. 
The present invention refers to a fomiulation of the algorithm that uses a 
reference timestanq) increment Tq to express the frame duration in a vlrtual«time 
I S domain. This formulation is functionally equivalent to the definition of DRR 

originally presented in [10], but is better suited to the description of the invention. 



Within a finme, each configured flow / is entitled to the transmission of a 
quantum Qi of information units such that 

20 

The scheduler visits the backlogged flows only once per frame, and therefore 
fulfills in a single shot their service expectations for the frame. Each flow / 
25 maintains a queue of packets (flow queueK and a timestamp Ft that is updated 
every time a new packet of length reaches the head of the flow queue: 
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The scheduler keeps servicing the flow as long as its timestamp remains smaller 
than Tq. When the timestamp exceeds the reference timestamp increment Tq^ 
the scheduler declares the visit to flow / over, subtracts Tq fix>m the timestamp of 
flow 2, and looks for another backlogged flow to serve. As a result, after 
5 subtraction of Tq, the value of Ft expresses a service credit for flow i. In general, 
the timestamps carry over the service credits of the backlogged flows to the 
following frames, allowing the scheduler to distribute service proportionally to 
the allocated service shares in the long term (i.e., over multiple frames). 

When a f^ow i becomes idle^ the scheduler immediately moves to another 

10 flow to start giving it service. If flow / becomes backlogged again in a short time, 
it must wait for the next frame to start in order to receive a new visit from the 
server. When the flow becomes idle, its timestamp is reset to zero to avoid any 
loss of service when the same flow becomes backlogged again in a future frame. 
By construction, the timestamp of an idling flow is always smaller than Tq, so that 

1 5 the timestamp reset never generates extra service credits thsal would otherwise 
penalize other flows. 

By construction, at the beginning of a service frame the value of 
timestamp of flow / ranges between 0 and fp^ , where Z.i is the maximum size 

20 of a packet of flow i. The fluctuation of the initial value of the timestamp induces 
the fluctuation of the amount of information units that flow i transmits in a frame, 
which ranges within the interval {Q, -L,^Q,^L,). Accordingly, the total amount 
of informatioa units that the server transmits in a frame is not fixed, even when all 
configuzed flows are permanently backlogged. 

25 

The DRR scheduler was implemented in [10] with ai single linked list of 
backlogged flows, visited in FIFO order. The arrangement of the backlogged 
flows in a single FIFO queue leads to 0(1) implementation complexity, provided 
that the reference timestamp increment Tq is not smaller than the timestamp 
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increment determined by the maximum-sized packet for the flow with minimum 
service share: 

T^>^ (4) 

If the condition of eq. (4) is not satisfied, the algorithmic complexity of the 
5 scheduler explodes with the worst-case number of elementary operations to be 
executed between consecutive packet transmissions (elementary operations 
Include: flow extraction and insertion in the linked list; timestamp update; 
comparison of the timestamp with the reference timestamp increment). In fact, 
the scheduler may have to deny service to a given flow for several consecutive 
10 frames, until the repeated subtraction of the reference timestamp increment makes 
the timestamp fail within the[o, Tg) interval. Shown in Fig. 3 A is an illustrative 

listing of pseudo-code that specifies the rules for handling flow i and updating its 
timestamp in DRR when a new packet reaches the head of its queue. 

1 5 A description of Surplus Round Robin (SRR) is provided in [1 1 ]. The 

algorithm features the same parameters and variables as DRR, but a different 
event triggers the update of the timestamp: a flow i receives a new timestamp 

when the transmission of packet gets completed, independently of the resulting 

backlog state of the flow. The end of the frame is always detected aflm the 
20 transmissioa of a packet, and never before: the timestamp carries over to the next 
frame the debit accumulated by the flow during the current firame, instead of the 
credit that is typical of DRR. 

An advantage of SRR over DRR is that it does not require knowing in 
25 advance the length of the head-of-the-queue packet to determine the end of the 
frame for a flow. Conversely, in order to prevent malicious flows from stealing 
bandwidth from their competitors, the algorithm cannot reset the timestamp of a 
flow that becomes idle. The non-null timestamp of an idle flow is eventually 
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obsoleted by the end of the same frame in which the flow becomes idle. Ideally, 
the timestamp should be reset as soon as it becomes obsolete. However, in a 
scheduler that handles hundreds of thousands or even millions of flows, a prompt 
reset of all timestamps that can simultaneously become obsolete is practically 

5 imfx)ssible. The present description of the invention focuses on implementations 
of the SRR algorithm that do not perform any check for obsolescence on the 
timestamps of the idle flows, and where a newly backlogged flow always resumes 
its activity with the latest value of the timestamp, however old that value can be. 
The effect of this assumption is that a newly backlogged flow may have to give 

10 up pait of its due service the first time it is visited by the server, in consequence of 
a debit acciunulated long time before. Shown in Fig. 3B is an illustrative listing 
of pseudo-code that specifies the rules for handling flow i and updating its 
timestamp in SRR when the server completes the transmission of packet p^^ . 

1 5 For simplicity of presentation, in the rest of this document the Weighted 

Round Robin (WRR) name will be used to allude to DRR or SRR generically, 
with no explicit reference to their distinguishing features. 

WRR schedulers are essentially '^single-layered,"" in that they can control 
20 the distribution of bandwidth only to individual packet flows* Siqperimposing 
multiple scheduling layers, and thus implementing a hierarchical and flexible 
structure that can not only allocate bandwidth to individual flows, but also create 
aggregations of flows and segregate bandwidth accordingly, compromises the 
simplicity of {these sdiedulers* 

25 

The refmnce model for bandwidth segregation is shown in Fig. 4. The 
set of allocated flows (1, l)-</C is partitioned into subsets 401-1 through 
40 l«iC called bundles. Each bundle /aggregates Vg flows and has an allocated 
service ra^ . The logical organization of the scheduler reflects a two-layered 
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hierarchy: it first distributes bandwidth to the bundles according to their aggregate 
allocations^ and then serves the flows based on their bandwidth allocations and on 
the backlog state of the other flows in the respective bundles. The scheduler 
treats each bundle independently of the backlog state of the corresponding flows, 

5 as long as at least one of them is backlog^^. The scheduler aims at the 

enforcement of strict bandwidth guarantees for both the flow aggregates and the 
individual flows within the aggregates, without trying to support delay guarantees 
of any isort (frameworks for the provision of stringent delay guarantees in a 
scheduling hierarchy are already available [7, 1 2], but they all resort to 

1 0 sophisticated algorithms that considerably increase the complexity of the 
scheduler). In order to support the bandwidth requirements of the flow 
aggregates, the following condition must always hold on the rate allocations of 
the bundles: 



1 5 Similarly, the following boimd must be satisfied within each bundle / in order to 
meet the bandwidth requirements of the associated flows: 



A schediiling solution inspired by the frameworks presented in [7, 12} 
would introduce a fuU-fiedged (and expensive) scheduling layer to handle the 

20 bundles in between the flows and the link server. Generally, the implementation 
cost of a fuli-fle<^ed hierarchicai scheduler grows linearly with the number of 
bundles, because each bundle requires a separate instance of the basic per-flow 
scheduler. In the present invention, on the contrary, the layer that enforces the 
bimdte requirenients in the scheduling hierarchy is purely virtual* and is 

25 superimposed on a single instance of the basic scheduler The cost of the 

structure that handles the individual flows is therefore independent of the number 
of configured bundles, which leads to substantial savings in the implementation of 
the scheduling hierarchy. 




(5) 



(6) 
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In accordance with the present invention, an enhanced. WRR scheduler 
segregates bandwidth hierarchically without requiring substantial modifications of 
the basic structure of a WRR scheduler. According to the new technique, the 
5 scheduler provides bandwidth guarantees to aggregations of flows (the bundles) 
as well as to individual flows in a completely transparent manner, i.e., without 
using any additional scheduliiig structure. The invention achieves this objective 
by simply enhancing the way the scheduler manipulates the tiniestamps. The 
resulting "^soft** scheduling hierarchy has negligible complexity; yet, it is effective 
10 in providing bandwidth guarantees to the bundles as well as to the individual 
flows. 

Figure 5 shows a functional diagram of the queues, state tables, registers, 
and parameters utilized by the enhanced WRR scheduler for the soft enforcement 

1 3 of hierarchical bandwidth segregation. With joint reference to Figs. 4 and 5, the 
scheduler handles a plurality SOI of data packet flows ii through JfV, which are 
grouped in a plurality of bundles 401-1 through 401*^. The flow queties 502 
store data packets for respective ones of the data packet flows 501 . The flow 
queues 502 may be implemented as First-In-First-Out (FIFO) queues* Each of the 

20 flow queues 502 has an associated per-flow stale table 503, which stores several 
variables for the corresponding flow 501. In the case of flow i/, for example, the 
per-flow state table 503 includes a timestamp , a minimum guaranteed service 

Tatept^ , a frame flagFF^ , a pointer to the bundle / that includes the flow, and the 

head and tail pointers of the associated flow queue 502 (not shown). The state of 
25 the bundles is maintained in the per-bundle state tables 504. In the case of bundle 
/, for example, the per-bundle state table 504 stores the aggregate bandwidth 
allocation J?; , the nmning shared,, the cxmnilative share O, , and the start flag a. 
A FIFO queue 505 of flow pointers detenrnines the order by which the scheduler 
visits the flows 501 to transmit their packets. The registers 506 store the head and 
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tail pointers of the FIFO queue of flow pointers SOS. The scheduler maintains 
global*5tate infonnation in table S07, The table contains a frame counter 
FRMCNT and a reference timestamp increment Tq . 

5 Figure 6 shows an illustrative block diagram of an input communication 

link interface 200 in which the scheduler may be utilized. The communication 
link interface 200 includes a data packet receiver 60U a schedula: 602, and a 
packet transmitter 609. Illustratively, the scheduler 602 is shown to include a 
controller 603, a per-bundle-state RAM 604, and registers 60S, all on the same 
10 chip 606. A packet RAM 607 and a per-flow-state RAM 608 are shown as being 
located on separate chips. Obviously, depending on the operating capacity and 
other characteristics, the scheduler 602 may be implemented in other 
configurations* 

1 5 The controller 603 stores and runs the program that implements the 

method of the present invention. An illustrative example of the program that 
controls the operation of the communication link inter&ce 200 is shown in flow- 
chart form in Figs. 7A-B. With joint reference to Figs. S and 6, the packets in the 
flow queues 502 are stored in packet RAM 607; the per-flow state tables S03 are 

20 stored in per-flow-state RAM 608; the head and tail pointers 506 of the FIFO 
queue of flow pointers 505 and the global-state table 507 arc stored in registers 
605. 

A brief overview of the operation of the scheduler 602 is as foUows. The 
25 packet receiver 601 receives from input links 201-1 through 201-r the data 

packets of the data packet flows SO 1 . Packet receiver 60 1 uses die contents of a 
flow-identification field contained in the header of each packet (not shown) to 
identify its respective data packet flow SOL The identification of the data packet 
flow 501 leads to the identification of the associated flow queue 502 and bundle 
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40 1 . The scheduler 602 uses the entries of the per-bundle state table S04 to 
compute a service ratio for each bundle /. The service ratio of bundle / is defined 
as the ratio between the nominal bandwidth allocation R, and the cumulative 
share <t>; of the bundle (the cumulative share accumulates the bandwidth 
5 allocations of the backlogged flows of bundle / ). By construaion, the service 
ratio of bundle / is never smaller than 1 . The service ratio is involved in the 
computation of the timestamp increments for the individual flows, where it 
determines an increase in the amount of service that each backlogged flow of 
bundle / receives during a frame when other flows of the same bundle are 
10 detected to be idle at the beginning of the frame. The modulation of the 

timestamp ixKrements induced by the service ratio is such that the scheduler can 
still guarantee both the aggregate bandwidth allocations of the bundles and the 
individual bandwidth allocations of the flows. 

15 DETAILED OPERATION 

The enhanced WRR scheduler of the present invention achieves 
hierarchical segregation of service bandwidth by superimposing a virtual 
scheduling layer on a WRR scheduler of the prior art. The scheduler supports a 
multitude of bundles, each bimdle being an aggregate of data packet flows. In a 

20 timestamp-based formulation of the underlying WRR schedtiler (at this point of 
the discussion, the distinction between DRR and SRR is not yet relevant), the 
technique of the pres«t invention relies on a simple modification of the 
timestam{>-updating rule to preserve the bandwidth guarantees of the individual 
flows and at the same time satisfy the aggre^e requirements of the bundles. 

25 

For each configured bimdle A the scheduler maintains a nominal 
bandwidth alliocation and a cimiulative share O, . The nominal bandwidth 
allocation R^ is never smaller than the sum of the bandwidth allocations of all the 
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flows in the bundle. The cumulative share tracks the sum of the bandwidth 
allocations of the backlogged flows of bundle A and is therefore never greater 
than R, : 

= Z A (7) 

5 In eq. (7), Bf is the set of flows of bundle / that are backlogged at the beginning 
of the frame. 

The scheduler uses the service ratio between the cumulative rate allocation 
R; and the cumulative share c|>, of bundle / to modulate die actual service 
10 bandwidth that is granted to the flows of the bundle. More particularly, die 

service ratio contributes to defining the timestamp increments for the flows of the 
bimdle. What follows is the timestamp update associated with packet of flow 
/: 

/r*«/r*-i+iL.^ (8) 

1 5 Verifying that the timestamp assignment rule of eq. (8) actually enforces 

the bandwidth guarantees of the bundles requires to compute the amount of 
service that the individual flows in bundle / may expect to receive during a frame. 
In principle, when the service ratio R, /<t>j is greater than 1 .0 for btmdie /, the 
timestamp assignment rule of eq. (8) results in the individtial flows of bundle / 

20 getting a service bandwidth that is R^ /Oj times higher than their nominal 
bandwidth allocation. The computation is based on the following two 
assumptions: (1) the cumulative share <t>; of the bundle remains unchanged duhnc 
the whole frame, independently of the backlog dynamics of the flows in die 
bundle; and (2) the set of flows that can access the server during the frame 

25 includes only the flows that are backlogged at the beginning of the frame (if some 
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flows in the bundle become backlogged after the frame has started, they must wait 
until the beginning of a new frame before they can access the server). 

The reference timestaii^) increment Tq , combined with eq. (8), sets the 
5 reference amount of service that flow i of bundle / expects to receive durii^ a 
frame: 

p, R,. 

Then, the t^gregation of die service quanta of all die flows in bundle / leads to 
the service quantum 0, of the bimdie: 

io Qi-J,Q.-'^RrT^^RrTg (10) 

The expression of in eq. (10) is identical to the expression of the flow quantum 

in eq. (2), and therefore proves that the timestamp-updadng rule of eq. (8) 

preserves the bandwidth guarantees of bundle independently of the composition 
of the set of flows that are backlogged in the bundle at the beginning of the frame. 



15 



Holding on the assumption that the cumulative share of bundle / does not 
change during the frame, it can also be shown that the time^amp updating rule of 
eq. (8) preserves the service proportions for any two flows i, / of bundle / that 
never become idle during the frame: 



20 ^» ^ (ID 



The specification of the details of the WRR algorithm with bandwidth 
segregation requires a discussion of the assumptions dsai produce the results of 
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eqs. (10) and (11) and the evaluation of their algorithmic implications. The use of 
a constant value of cumulative share 4>; in all the timestamp increments that the 
scheduler computes during a frame provides a common reference for consistently 
distributing service to the flows of bundle /. Identical purpose has the exclusion 
5 from the service frame of the flows that become backlogged only after the frame 
has started. The timestamp increment can be viewed as the charge that the system 
imposes on a flow for the transmission of the related packet The cost of tl^ 
transmission depends on the bandwidth that is available within the b\mdle at the 
time it is executed. In order to make the timestamp increment consistent with the 

10 cost of the bandwidth resource within the bundle, it must be computed when the 
resource is used, i.e., upon the transmission of the corresponding packet. If the 
scheduler computes the increment in advance, the state of the bundle (and 
therefore the actual cost of the bandwidth resource) can undergo radical changes 
before the transmission of the packet occurs^ thus making the charging 

1 5 mechanism inconsistent with the distribution of bandwidth. 

Within the pair of WRR algorithms under considoation, SRR is the one 
that best fits the requirenaent for consistency between transmissions a^ 
timestamp increments, because it uses the length of the just tis»nitted packet to 

20 update the timestamp and determine the in-frame status of the corresponding 

flow. In DRR, on the contrary, the scheduler performs the timestamp update and 
the in^frame status check using the length of the new head-of-the-queue packet, 
possibly long before it is actually transmitted. When the DRR server finally 
delivers the padcet, the cumulative share of the bundle, and therefore the cost of 

2S bandwidth within the bundle, may have changed considerably since the latest 
timestamp update. 

Introducing the mechanism for bandwidth segregation m SRR is 
straightforward. In addition to the niiniihum-bandwidth guarantee and the 
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cumulative share Oj , each bundle / maintains a nmning share and a start flag 
. The running share keeps instantaneous track of the sum of the service shares 
of the backlogged flows in the bundle: 

<*/(0= Xa (12) 

5 The running share is updated every time a flow of the bundle changes its backlog 
state. In general, the updates of the running share ^, do not translate into 
immediate updates of the cumulative share <t>, . In fact, the scheduler updates the 
cumulative share <Dy of the bundle only upon detection of misnatching values in 
the start flag cr, and in a global single-bit frame counter FiU/CA^ that the 

10 scheduler toggles at every frame boundary (the scheduler compares and 

FRMCNT every time it starts servicing a flow of bundle / ). A difiference in the 
two bits triggers the update of the cumulative share to be used in the future 
timestamp computations (4>; 4- ) and toggles the start flag of the bxmdle 
( <T, 4* FRMCNT ). Ift instead, the two bits are already equal, the service just 

15 completed is certainly not the first one that the bundle receives during the current 
frame, and no action must be taken on the bundle parameters. When the first flow 
of a bundle becomes backlogged, the start flag is set equal to FRMCNT : 

<T,<r- FRMCNT (13) 

20 

In order to identify the end of a frame, each flow i maintaias a frame flag 
FF, . The frame flag of flow i is set to the complement of FRMCNT vAienev^r the 
flow is queued to the tail of the list of backlogged flows SOS. When the scheduler 
Gnds a frameiflag that does not match the frame counter, it declares the^start of a 
25 new frame and toggles the frame counter. The sequence of operations to be 
executed after conq>leting the transmission of a packet is sunmiarized in the 
pseudo-code of Fig. 8^ 
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Figiires 7A-B depict in flow-chart forni a method of operating the 
scheduling apparatus of Fig. 6 for controlling the scheduling of packet 
transmissions in accordance with the present invention. The flow chart of Figs. 
5 7A-B and the pseudo-code of Fig. 8 are based on the assumption that SRR is the 
underlying scheduling algorithm. As far as functionality is concemed, there is no 
problem in using DRR instead of SRR. Similariy, the apparatus of Fig. S 
implements the soft scheduling hierarchy using a single FIFO queue of 
backlogged flows. Any other queueing structure tha;t allows a clear separation of 
1 0 in-frame and out-of-&ame flows could be used as well. 

The following description makes reference to Figs. 4« 5» 6, and 7A-B. The 
reference numbers to elements that first appear in Fig. 4 (5, 6) begin with a 4 (5» 
6), while the steps of Figs. 7A-B are indicated by an S preceding the step number, 
15 e.g., SSIO. 

In Figure 7A, the controller 603 checks in S5 1 0 if there are newly received 
data packets. If there are no newly received packets in S5 10, and there are 
backlogged flows in S520, control passes to S680. If, instead, there are no newly 

20 received data packets in SS 10 and ^ere are no backlogged flows in SS20, then the 
controller 603 cycles between steps SSIO and SS20 until there are new packets 
received. When the presence of newly received packets is detected at receiver 
601 in SStO, the oontioUer 603 selects one of the packets in S5S0. Then, the 
controller 603 identifies the flow of the data packet in SS60, and Anally stores the 

25 packet in the appropriate flow queue 502 (in SS70>» If the queue length for the 
identified flow is not zero in S580, the queue lengdi for that flow is incremented 
in S970 and control passes to S680. If, instead, the queue length for the identillcd 
flpw is zero in S580, tte frame flag of the flow (in per*flow state table S03) is set 
to the complement of FRMCNT (in global*state table SOT) in SS8S: then, the 
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controller increases the total number of backlogged flows in SS90 and the q\ieue 
length for the identiHed flow in S600. The bundle of the identified flow is 
identified in S610 using the bundle pointer of per-flow state table S03. ]n S620, it 
is determined if the running share of the bundle is equal to zero. If the running 
5 share of the identified bundle is nulU in S630 the conlroUer 603 sets die start flag 
of the bundle (stored in per-bundles state table 504) to the value of the global 
frame counter FiSMTAT. Then» control passes to S63S. If^ instead, the running 
share of the bundle is not zero in S620, control passes direcdy from S620 to S63S. 
In S63S, the running share of the identified bundle is increased by the bandwidth 
to allocation of the newly-backlogged flow. Then, in S640, the flow is appended to 
the tail of the FIFO queue of flow pointers 505, and, after that, control passes to 
S680. 

In S680, if the transmitter 609 is busy in the transmission of an old packet 
1 5 and is therefore not available for the transmission of a new packet, control returns 
to SSIO; otherwise, the availability of a just serviced flow waiting for post*service 
processing is checked in S700. If a just serviced flow is not available in S700, it 
is determined in S710 if there are any backlogged flows. If no backlogged flows 
exist, control returns to S510, otherwise the flow 501 at the head 506 of the FIFO 
20 queue of flow pointers 50S is selected for service in S720 and the first data packet 
in the flow queue 502 of die selected flow is sent to transmitter 609 in S730. In 
S740, if the frame flag of tlu flow 501 selected for service is equal to the global 
frame counter FRMCNT^ then control returns to S5 10; otherwise the global frame 
counter FJtMCWT is toggled in S7S0 and control returns to S510. 

25 

In S700, if a just serviced flow 501 waiting for post-service processing 
available, the controller moves to S760, where it decrements the length of the 
corresponding flow queue 502. The bundle 401 of the just serviced flow SOI is 
identified in S770 using the bundle pointer of per*flow state table 503. In S78() it 
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is cteteimined if the globid frame counter FJCA^ 

bxmdle 401 of the just serviced flow SOI . If the start flag is not equal to the global 
frame counter, the controller sets the cumulative share of the bundle equal to the 
running share of the bundle (in S790) and the start flag of the bundle equal to the 
5 global frame counter (in S800). Control then passes to S8I0 for the update of the 
flow timestamp. If, in S780, the start flag of the bundle of the just s^viced flow 
is equal to the global frame counter^ control passes directly to S8 10 for the update 
of the Cimestamp of the just serviced flow. 

1 0 In S820 the controller 603 determines if the queue length of the just 

serviced flow is equal to zero. If the flow queue 502 of the serviced flow SOI is 
determined to be empty in S820, the controller checks in S880 if the timestamp of 
the just serviced flow is greater than or equal to the reference timestamp 
increment of global-^ate table 507. If the timtestamp is greater than or equal to 

15 the reference timestamp increment, control passes to S890, where the timestamp 
is reset within the valid range (0, Tq). Control then passes to S900. If, in SS80, 
the timestamp of the just serviced flow is detennined to be smaller than the 
refoence timestamp increment, control passes directly to S900. In S900, the 
pointer to the just serviced flow is extracted from the iMsad 506 of the FIFO queue 

20 of flow pointers 505. Then, in S910, the runnii^ share of the bundle of the just 
serviced flow is decreased by the bandwidth allocation of the just serviced flow 
that just becttne idle. In S920, the controller 603 decremenls the total number of 
backlogged flows^ and then moves to S71 0. 

25 If, in S820i the flow queue 502 of die serviced flow 501 is detemnined to 

be non-empty, the controller checks in S830 if the timestamp of die just serviced 
flow is greatc^r than or equal to the reference timestamp increment of global-state 
table 507. Ifthe timestamp is smaller than the reference timestamp iiKrement, 
control passes to S840, where the frame flag of the just serviced flow is toggled. 
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Theiu in S8S0, the timestamp of the just serviced flow is reset within the valid 
range (0, Tq^* In S860, the controller 603 extracts the pointer to the just serviced 
flow from the head 506 of the FIFO queue of flow pointers SOS. The same 
pointer is queued back to the tail S06 of the FIFO queue SOS in S870. Control 
5 then passes to S710* In S830, if the timestamp of the just serviced flow is 
determined to be greater than or equal \q the reference timestamp increment, 
control passes directly to S710* 

The listing of Fig. 8 describes in pseudo^code form a method for selecting 
10 and post-service processing a flow / of bundle / in accordance with the present 
invention. 

The illuistrative embodiments described above are but exemplary of the 
principles that may be used to superimpose a soft scheduling hierarchy on a 
IS Weighted Round Robin scheduler to achieve hierarchical bandwidth segregation 
in accordance widi the present invention. Those skilled in the art will be able to 
devise numerous arrangements which, although not explicitly shown or described 
herein, nevertheless embody those principles that are within the spirit and scope 
of the present inveotiQa as defined by the claims ai^sended horeta. 

20 
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CLAIMS: 

1 L An apparatus for scheduling the transmission of data packets for a 

2 plurality of data packet flows, said data packet flows being allocated given shares 

3 of the transmission capacity r of a communication link and being grouped in 

4 bundles, said bundles being allocated service shares of the processing capacity of 

5 said communicatioti link, the transmission over the communication link being 

6 divided in service frames, a service frame offering at least one transmission 

7 opportunity to every data packet flow that is backlogged, a badclogged data 

8 packet flow being a data packet flow that has at least one data packtt stored in 

9 respective one of a plurality of packet queues, the scheduling apparatus 
10 comprising: 

11 

12 means for determining the duration of the service frame; and 

13 

14 means for guaranteeing that each data packet flow always receives at least 

15 its allocated service share if it remains continuously backlogged over a sufficient 

16 number of consecutive service frames, and that each bundle receives at least its 

17 allocated service share if there is always at least one data packet flow in the 

18 bundle that remains continuously backlogged for the whole duration of a service 

19 frame over a sufficient number of consecutive service frames, said guaranteeing 

20 means including: 
21 

22 means for maintaining, for each bundle /, a cumulative share <i>, that 

23 . relates to the sum of said service shares allocated to respective ones of said data 

24 packet flows that are grouped together in the same bundle /; 
25 



25 



CA 02366781 2002-01-08 



F. M. Chiussi 23.1-2-11-1 



26 means for computing, for each bundle /, a service ratio between the 

27 service share allocated to said bundle / and said cumulative share <I>^ of the 

28 bundle; and 
29 

30 means for modulating said s^vice shares allocated to respective ones of 

31 said plurality of data packet flows using the service ratio computed for respective 

32 ones of said plurality of bundles. 

1 2. The scheduling apparatus of claim 1, wherein an algorithm selected 

2 from a group inchiding a Weighted Round Robin (WRR) algorithm, a Deficit 

3 Round Robin (DRR) algorithm, or a Surplus Round Robin (SRR) algorithm is 

4 used to schedule the transmission of data packets. 

1 3. The scheduling ^^paratus of claim 1, wherein said means for 

2 determining the duration of a sarvice frame include: 

3 

4 a global frame counter FRMCNTi 

5 

6 a start flag aj for each bundle / of said plurality of bundles; and 

7 

8 a frame flag FF^ for each data packet flow / of said plurality of data 

9 packet flows. 

1 4, The scheduling apparatus of claim 3, wh^ ein the start flag cr^ of 

2 bundle / is set equal to the global frame counter FRMCNT when the first data 

3 packet flow in the bundle becomes backlogged. 

1 5, The scheduling apparatus of claim 3, wherein the frame flag Ff] of 

2 data packet flow / is set to a different value than the global frame counter 
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3 FRMCNT when the flow becomes backlogged or is processed for the last time in 

4 the cuirent swice frame. 

1 6. The scheduling apparatus of claim 3, wherein the end of a service frame 

2 and the start of the following one are simultaneously detected when the fi^e 

3 flag FF^ of the next data packet flow / to be processed has dififerent value than the 

4 . global frame counter FRAf^NT. 

1 7. The schedulii^ apparatus of claim 1, wherein the vahie of the 

2 cumulative share 0^ of tnindle / is equal to the sum of the service shares of the 

3 data packet flows of bundle/ that are backlogged 

1 8. The scheduling apparatus of claim 1, wherein the value of the 

2 cumulative share of bundle / is set when a first data padcet flow of the bundle 

3 is first serviced in a service frame, and kept unchanged for the whole duration of 

4 the same service frame, even if the backlog state of one m a phirality of data 

5 packet flows of bundle / changes during the service frame. 

1 9« A method fbr scheduling the transmission of data packets for a 

2 phira^ of diita packet flows, said data packet flows being allocate given shares 

3 of the transmission capacity of an outgoing communication link and being 

4 grouped in a plurality of bundles, said bundles being allocated service shar» of 

5 the transmission capacity r of said outgoing communication link» the transmission 

6 over the communication link being divided in s^ice frames, a service frame 

7 ofTering at least one transmission opportunity to every data packet flow that is 
S backlogged, a backlogged data packet flow being a data packet flow that has at 
9 least one data packet stored in reispective one of a plurality of packet quwes, the 

10 method comprising the steps of. 
11 
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12 det^iniiiing the duration of the service frame; 

13 

14 guaranteeing that each data packet flow always receives at least its 

15 allocated service share if it remains continuously backlogged over a sufficient 

16 number of consecutive service frames, and that eadi bundle receives at least its 

17 allocated service share if there is always at least one data packet flow in the 

18 bundle that remains continuously backlogged for the whole duration of a service 

19 frame over a sufficient number of consecutive service frames; 
20 

21 maintainmg, for each bundle /, a cumulative share that relates to the 

22 sum of said service shares allocated to respective ones of said data packet flows 

23 that are grouped together in the same bundle /; 
24 

25 computing, for each bundle /, a saifice ratio b^ween the service share Rj 

26 allocated to said bundle / and said cumulative share of the bundle; and 

27 

28 modulating said service shares allocated to respective ones of said 

29 plurality of data packet flows using the service ratio computed for respective ones 

30 of said plurality of bundles. 
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FIG. 3A 

1 if (flow i is newiy backlogged) 
2 



* Pi 



3 Append i to the toil of the linked list 

4 else I* k packet of i has just been tronsmitted */ 

5 ff^/M + li 

V * Pi 

6 if(^^>V 

7 H^^i-'^q • 

8 Conclude visit to flow i 
i9 dse 

10 Keep servicing flow t 



FIG. 3B 



1 Ff\-F^U^i 

Pi 

3 'f^-^'-^i 

4 Conclude visit to flow i 

5 else if (flow t is stiB backlogged) 

6 Keep servicing flow i 
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FIQ. 8 



1 


Identify flow i currently at the head of the linked list 


0 


Identify bundle / of flow t 


3 


if \Ff^ * FRMCNT) 


4 


FRMCNT <- -1 rKMuiJ 


5 


u 

Prepare head-of-the-queue packet for transmission 


g 


if Idw * FRMCNf\ 


7 




A 

o 


ffr 4- FRMCNT 


9 






10 


If (f * > T-) /• Frame over for flow i ♦/ 


11 




12 




13 


Extract flow i from head of linked list 


14 


if (now i Is sUII backlogged) 


15 


Append flow t to tail of linked list 


16 


else /* Flow i is getting Idle */ 


17 


♦/<-♦/ -Pi 


18 


else if (Flow i ijs getting idle) 


19 


Extract flow i from head of linked list 


20 


♦/<-♦/- Pi 



